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Abstract. We present the program Temanejo, a debugger for task based 
parallelisation models such as StarSs. The challenge in debugging StarSs 
applications lies in the fact that tasks are scheduled at runtime, i.e 
dynamically in accordance to the data dependencies between them. Our 
tool assists the programmer in the debugging process by visualising the 
task dependency graph and allowing to control the scheduling of tasks. 

The toolset consists of the library Ayudame which communicates with 
the StarSs runtime on one side and of the debugger Temanejo on the 
other side which communicates with Ayudame. Temanejo provides a 
graphical user interface with which the application can be analysed and 
controlled. 
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Introduction 

Task based programming models have amplified the landscape of parallelisation 
paradigms. In these models, parallelism is not dictated on the level of operations. 
The programmer rather indicates data dependencies between parts of the code 
which become tasks and then are scheduled by an execution framework at runtime. 
This way the program can "react" on internal irregularities, e.g. differences in 
execution time of each task, and external conditions, such as hybrid architectures. 

One family of these programming models are the StarSs models |1I2) . The 
"Ss" stands for superscalar and indicates that these models aim at scaling on 
modern supercomputers consisting of millions of processing units. The "Star" in 
the name paraphrases a wildcard which stands for the different implementations 
of this model like CellSs, CPUSs, GridSs, OMPSs and SMPSs [2] among others. 
They differ in the supported programming languages, whether functions (respec- 
tively subroutines) or other blocks of code are parallelised and whether the target 
platform is, for instance, a grid or a node of CPUs or CPUs. 

Debugging a task parallel program differs from debugging an otherwise paral- 
lelised program as the parallelisation is determined by data dependencies instead 
of explicit scheduling of tasks and synchronisation between them. Therefore, the 
development of new debugging tools capable of assisting the programmer with 



debugging task parallel programs is a crucial condition for effectively exploiting 
the possibilities of these models. 

We present a debugging toolset for task parallel programs consisting of two 
components: a library called Ayudame]^ acting as a thin communication layer 
and enabling the debugger to receive information about the program and send 
commands back to it, and the actual debugger TEMANEJc[^which enables the user 
to view the extracted information and issue commands via a highly interactive 
graphical user interface. 

For this work, we used the StarSs implementations SMPSs and OMPSs. Nev- 
ertheless, the basic concepts of how to debug a task parallel program are the same 
regardless of the specific framework. 

In section [TJ we elaborate on the idea and further implications of task based 
parallelism. We present different strategies and actions a debugger should pro- 
vide and discuss how these features were implemented in the presented debugger 
Temanejo in section [2] Finally, we summarise the results and give an outlook 
on future enhancements to the debugger in section [3j 

1. Task based parallelism 

In contrast to other task based parallel programming models, the focus in the 
StarSs family lies on the data dependencies between parts of the program which 
are defined as tasks. In SMPSs, for example, the programmer marks functions or 
subroutines as "potentially parallel" using a special #pragma syntax. A typical 
function declaration in C looks like this: 

#pragma ess task input (a) inout(b) output (c) 

void funcCint a, int *b, double *c) 

{ 

/* function body */ 

} 

cs^is the identifier of the pragma, the keyword task tells the precompiler 
that the following function declaration is to be treated as an SMPSs task, and the 
keywords input, inout and output specify the data dependencies for this task. 

These pragmas are read by a compiler wrapper (smpss-cc) which embeds the 
program code into a runtime framework. This framework will initialise a previ- 
ously defined number of threads and start the application. One of the threads will 
be the master thread which creates all tasks and takes care of proper initialisation 
and finalisation of the program. Moreover the master thread will execute all code 
outside of tasks. The tasks will be executed by the other threads, called worker 
threads 

During runtime the SMPSs framework will assign all calls to functions marked 
by pragmas to a thread for execution depending on the specified dependencies. 
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^css stands for cell super scalax and is a historical remnant of the first versions of SMPSs. 
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Figure 1. State transition diagram of a task. The task is always in one of the states "not queued", 
"queued", "running" or "finished". The processing of the successors is not counted as an own 
state. The transition to a state is triggering an event indicated by the numbers. 



To achieve this, a task must be in one of four states (see figure [T]): "not 
queued", "queued", "running" or "finished". When tasks are created they are not 
queued in case they have pending dependencies, i.e., they read (input) from a 
memory address which has been declared output (or inout) by another task at an 
earher time. Otherwise, the task is queued directly after creation. 

When all tasks on which a task depends on are finished, the task is queued. A 
queued task can be run at any time by an idling thread. When a task is executed 
by a thread, its state changes to running. After finishing the taks, its state is 
finished. 

The transition between the states is a useful information for the programmer 
(see section 2.2). Furthermore, the state transition diagram (figure [l]) indicates 
that there is more than one way to block or prioritise a task. In the case of block- 
ing, one can avoid for a task to be queued or avoid that it is ever dequeued for 
execution by a thread. While a task is blocked, one can let the other threads 
continue executing tasks or interrupt the program execution entirely. Other possi- 
bilities for debugging include stopping only a certain function or adding artificial 
dependencies in order to force a certain order of subtree execution. 

Some of these possibilities are exploited by the debugger Temanejo. How 
this was done is explained in the following. An outlook on future features is given 
in section |3l 



2. The debugger Temanejo 



In this section wc will investigate the different requirements regarding a debugging 
tool for task based problems and how the debugger Temanejo fulfils them. A 
rather technical introduction to Temanejo is given in the next section. 

For the programmer it is necessary to see (section 2.2 1 and be able to manip- 
ulate (sections [2.3.1 2.3.2 and 2.3.31 the dependencies which lead to the condi- 
tions for executing tasks and thus alter the way the runtime will step through the 
tasks. As an additional information the duration of each task can be displayed 
(section 2.4 1. 

In contrast to other parallelisation strategies, when using task based models 
the programmer needn't (and cannot) know a priori when a certain task is exe- 
cuted. At compile time only the conditions for running a task are known. There- 
fore, classical debuggers which step through lines of code sequentially are not well 
suited for debugging task parallel programs. They form a part of the debugging 
process (see section 2.3.41 but need to be complemented by other tools. 



2.1. Technical details 



Temanejo is written in python and requires python version 2.6 or newer. Fur- 
thermore, it imports several modules, the most important being Networkx [3; for 
the graph data structure and pygtk 4J for the visualisation and the graphical user 
interface. 

The debugger Temanejo is tightly Hnked to the small library Ayudame. 
This library serves as a thin communication layer between the runtime and 
Temanejo and has to be preloaded to the application. The SMPSs and OMPSs 
framework (thus indirectly the application) are instrumented with calls to an 
event handler which passes the information to the debugger as a set of 8 inte- 
ger values containing an event id, the task id and, depending on the event, one 
or more of the following: the task id of a dependency, the memory address of 
a dependency, the id of the function assigned to the task, a thread id and/or a 
timestamp. This is done at certain crucial points in the program execution, for 
instance initialisation, task generation, task execution etc., the event function is 
called and will react accordingly. In most cases, it will simply pass the data (task 
id, thread id, dependencies and so forth) to the attached socket client (in this 
case Temanejo). 

Temanejo uses this information to build an internal data structure repre- 
senting the dependency graph. Each task is a node of this graph and each depen- 
dency an edge. Nodes and edges have properties such as the task status and the 
memory address of the dependency, respectively. 

The information is displayed by Temanejo at the time it is received and 
the user can access the debugging features instantly by pressing buttons or right- 
clicking on nodes. There are four distinct properties available for displaying in- 
formation: the node colour, the node-margin colour, the node shape and the edge 
colour. This way one can produce different representations of the same graph 
depending on what information is selected for visualisation at a given debugging 
session. 



Figure 2. A simple dependency graph with two independent subtrees consisting of ten tasks eacli 
visualised by Temanejo. Eight independent tasks are drawn in blue, two of them scheduled 
(green margin) and six of them queued (yellow margin). Three tasks with input and output 
dependencies are shown in yellow, four reduction tasks in red and two tasks with only one input 
dependency in green. The red margin indicates that a task has unfulfilled dependencies and can 
therefore not be queued yet. The node shapes (triangle and box) denote two different worker 
threads, the circle shape indicates that the task has not been assigned to a thread yet. The text 
labels and colours of the edges indicate the memory addresses of the dependencies. 

In the current version (0.5) Temanejo relies on connecting to the Ayudame 
hbrary as a socket cUent when starting up. This means that either the parallel 
application has to run when Temanejo is started or Temanejo will start the 
application at the beginning of the debugging session. 

2.2. Visualisation of the dependency graph 

The first step of data-flow oriented debugging is to visualise the dependency 
graph. Often this already helps the programmer to spot errors or other short- 
comings of his program. An example of such a dependency graph as displayed by 
Temanejo is shown in figure [2] 

It is important to generate and show the dependency graph while the program 
is running ( "online" ) rather than just collecting the dependency information for 
later review ("offline"). Firstly the tasks are generated dynamically during run- 
time hence sometimes the solution to a problem lies in the order in which tasks 
are generated. Secondly viewing the graph after running the parallel program 
makes the graph simply the result of a program rather than a tool to debug it. 
Therefore, passing information during runtime from the program to the debugger 
is essential. For SMPSs and OMPSs, this is achieved by linking to the library 
Ayudame. 



2.3. Controlling Task execution 

The second requirement for debugging is for the programmer to be able to interact 
with the running application. This means that the programmer can block a task, 
stop the program execution at some point, prioritise a task and attach a gdb 
session to the application. How these features are achieved by Temanejo, is 
described in the following. 

2.3.1. Blocking single tasks 

In order to block a specific task, the threads have to be kept from scheduling it 
for execution. At the same time it must be possible to unblock a task again. We 
achieve this by building a "to-be-blocked" list in the runtime from the user input 
to Temanejo. Whenever a thread wants to execute a task, it checks whether the 
task is in that list and if so he skips it and searches the queues for another one. 

2.3.2. Stop Execution when reaching a task 

Stopping the program execution at a certain point means to avoid the dequeueing 
of tasks for any thread. This is implemented by a global flag which is checked by 
an idling thread before looking for a new task in the queues. If the flag is set, the 
thread is sent back to idle. 

2.3.3. Prioritising tasks 

In order to prioritise a task, we make use of SPMSs' "critical" queue which is 
also accessible by adding the keyword highpriority to the pragma line before 
the function deflnition. With a right-click on the node in Temanejo the user can 
prioritise and deprioritisc any task before it queued. At the present stage it is 
not possible to "requeue" a task, i.e. to take a task from one queue and put it 
into another. In future versions this will be possible which will enable the user 
to prioritise a task which is already queued in order to execute a specific task as 
early as possible. 

2.3.4. Attaching gdb 

Temanejo is equipped with the possibility of launching gdb at any point of the 
debugging session. By default the terminal user interface of gdb starts and lets 
the user debug any part of the application in the usual line based way. 

Another way to launch gdb is by specifying a function in TEMANEJO at which 
gdb should set a breakpoint. Again, a gdb process is attached to the application 
process, a breakpoint is set at the specified function and the program execution 
is continued, i.e. the control is given back to TEMANEJO. When the program 
execution reaches the function, gdb takes control and the user can debug the 
function with gdb. 

2.4. Task duration analysis 

In order to enable the StarSs framework to balance each thread's load, the task 
must neither be to small nor too big. For SMPSs the optimal duration of a task 



in order to keep all threads busy while avoiding too much overhead wasted in 
generating, queucing, scheduling and cleaning up tasks, was shown to be of the 
order of hundreds of milliseconds. An application with tasks which run less than 
« 100/xs will suffer a huge overhead in the sense that the master thread will be 
busy managing the tasks while the worker threads will spend most of their time 
waiting for tasks. On the other hand, when some tasks are to large, the application 
runs into a load balancing problem because the worker thread with the large task 
will again keep other worker threads waiting. 

Therefore, it is important for the programmer to know how long the tasks last. 
In Temanejo the task durations can be shown in colour code for finished tasks. 
As the communication with the debugger alters the collection of time information, 
this can only be a rough guide and does not replace a profiling tool. Nevertheless, 
the ability of displaying timing data in the dependency graph effortlessly proved 
to be a powerful tool for programmers. 

3. Summary and outlook 

We presented Temanejo, a debugger for task parallel applications. The StarSs 
implementations SMPSs and OMPSs in conjunction with the library Ayudame 
enables the programmer to write task parallel codes and debug them with 
Temanejo. 

Temanejo is in constant development process but already provides powerful 
features which help to debug and enhance task parallel applications. Foremost 
it shows the dependency graph while it is built up and processed during the 
runtime of the application. The node colour, the node margin colour, the node 
shape and the edge colour are four properties which can assigned to display one 
of the following information: the function which the task executes, the thread 
which executes the task, the status of the task, the duration of each task (all of 
the before mentioned for node properties) and the dependency address (for the 
edge colour). 

The presented version Temanejo 0.5 already is a strong tool for the program- 
mer to debug applications parallelised with SMPSs and OMPSs. Nevertheless, 
many features are desirable in order to make the debugging even more convenient. 

In future versions, we will implement more possibilities to analyse the de- 
pendency graph. Interesting information about the graph arc longest and short- 
est paths, prioritisation recommendations based on this information, and display- 
ing (or greying out) subgraphs. This will give the developer deeper insights into 
dependency graphs which become quickly very complex in real-life applications. 

Furthermore, we will enhance the control of the application giving better 
access to queues. This will make it possible to block, prioritise or even run tasks 
at any time. With the intelligent automatised use of these features it will also be 
possible to run tasks in a predefined order, e.g. in order of generation, the same 
order of a previous run or any arbitrary order defined in a configuration file. 

Another important tool will be to change the number of threads at any time 
of the program execution. By that it will be possible to serialise the application 
at any point and get back to parallel execution afterwards. 



Also, we plan to enable the user to launch any debugger instead of just gdb. 

Finally, wc point out that Temanejo is, in spite of its stable operation and 
richness of feature, still in the process of active development. We encourage the 
reader to use it and send feedback of any kind in order to enhance this powerful 
and unique debugger in the most useful way. 
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